What We'll Cover
In Week 1, we introduced transformers as "attention machines" and explored a little how they differ from earlier architectures. This session goes deeper: we'll dive back into the transformer architecture that powers models like GPT, Claude, and Gemini, understand how attention mechanisms actually work, and explore cutting-edge innovations like Mixture of Experts that make modern models more efficient.
By the end of this session you should have a greater appreciation of how LLMs work—and why architectural choices matter for research applications.
Some of this may feel uncomfortably mathematical, so if you find yourself worrying about this, you can try to just get the overall gist of what is going on.
🧩 Transformer Architecture Fundamentals
Let's revisit the transformer architecture with more technical depth. While Week 1 gave us the intuition, here we'll understand the actual mechanisms.
This video is in addition to the videos from last week and is also a little technical, but it does give a slightly different description of the 3blue1brown explanations.
📹 Take a look at the 3Blue1Brown videos before this one, but here is a lecture series on the transformer architecture.
The Core Components
- Token embeddings: Converting discrete tokens into continuous vector representations
- Positional encoding: Injecting information about token position in the sequence
- Attention layers: The heart of the model—learning which tokens matter
- Feed-forward networks: Processing attended information within each layer
- Layer normalization: Stabilizing training across deep networks
- Residual connections: Allowing gradients to flow through many layers
Decoder-Only Architecture
Modern LLMs (GPT, Claude, LLaMA) use decoder-only architectures rather than the original encoder-decoder design.
- Autoregressive generation: Predict one token at a time, left to right
- Causal masking: Each token can only attend to previous tokens
- Simpler architecture: No cross-attention needed
- Unified pre-training: Single objective (next-token prediction) for all training
💡 Why decoder-only?
Decoder-only models proved more scalable and effective for general-purpose language understanding and generation. The encoder-decoder design is still used for specific tasks like translation.
📐 The Forward Pass: Input to Output
Here's what happens when you send text to an LLM:
- Tokenization: Text → token IDs (e.g., "The cat" → [464, 3797])
- Embedding lookup: Each token ID → dense vector (e.g., 4096 dimensions)
- Positional encoding: Add position information to each token vector
- Transformer layers (repeated N times):
- Multi-head self-attention: tokens attend to previous context
- Feed-forward network: process attended representations
- Residual connections + layer norm after each sub-layer
- Final layer norm: Normalize output representations
- Output projection: Map to vocabulary size (e.g., 50,000 tokens)
- Sampling: Choose next token based on probability distribution
Here is the original paper itself, in case you are feeling particularly brave: "Attention Is All You Need" (Vaswani et al., 2017)
👁️ Attention Mechanisms in Detail
Attention is the fundamental innovation that makes transformers work. Let's understand the different types and how they've evolved.
🔑 The Attention Intuition
When you read "The animal didn't cross the street because it was too tired," your brain automatically knows "it" refers to "the animal," not "the street." You attend to relevant context.
Self-attention mechanisms let transformers do the same thing: for each token, learn which other tokens in the context are relevant, and weight them accordingly when building representations.
Self-Attention
The core mechanism: each token computes attention scores with every other token in the sequence.
- Query, Key, Value: Each token produces three vectors through learned projections
- Attention scores: Dot product of Query with all Keys (how relevant is each token?)
- Softmax normalization: Convert scores to probability distribution
- Weighted sum: Combine Values using attention weights
🔍 Scaled Dot-Product
Attention scores are scaled by √d_k (square root of key dimension) to prevent extremely small gradients in softmax for large embedding sizes.
Multi-Head Attention
Instead of one attention mechanism, use many in parallel—each learns different patterns.
- Multiple heads: 32-96 attention heads in modern LLMs
- Specialized patterns: Each head might learn syntax, semantics, or other relationships
- Concatenation: Outputs from all heads combined and projected
- Richer representations: Capture multiple types of dependencies simultaneously
Cross-Attention
Used in encoder-decoder models (and multimodal systems): attend to a different sequence.
- Query from decoder: "What am I trying to generate?"
- Keys/Values from encoder: "What input information is available?"
- Translation example: Decoder attends to source language while generating target
- Vision-language models: Text decoder attends to image features
📹 I don't think that you can do much better than 3Blue1Brown to understand attention
⚡ Modern Attention Variants
Standard multi-head attention is computationally expensive. Modern models use optimized variants:
| Variant | Key Idea | Benefit | Used In |
|---|---|---|---|
| Multi-Head Attention (MHA) | Each head has its own Q, K, V projections | Rich representations | Original transformers, GPT-3 |
| Multi-Query Attention (MQA) | Share K, V across heads; unique Q per head | Faster inference, less memory | PaLM, some Llama variants |
| Grouped-Query Attention (GQA) | Multiple heads share K, V in groups | Balance between MHA and MQA | Llama 2, Mistral, GPT-4 (rumored) |
Research implication: GQA has become the dominant choice for new models—it provides most of MHA's quality with much better inference efficiency.
📍 Positional Encoding: Teaching Position
Transformers process all tokens in parallel (unlike RNNs which are sequential). But word order matters! Positional encoding solves this problem.
Absolute Positional Encoding
Original approach: add a position-specific vector to each token embedding.
- Sinusoidal encoding: Original transformer used sin/cos functions of different frequencies
- Learned positions: Some models learn position embeddings during training (like GPT)
- Fixed context: Limited to maximum sequence length seen during training
- Issue: Doesn't extrapolate well to longer sequences
Relative Positional Encoding
Modern approach: encode the distance between tokens rather than absolute position.
- RoPE (Rotary Position Embedding): Rotate Q and K vectors based on position—used in LLaMA, Mistral, GPT-NeoX
- ALiBi (Attention with Linear Biases): Add bias to attention scores based on distance—used in BLOOM, MPT
- Length extrapolation: Can handle sequences longer than training context
- Better generalization: Understands "distance" concept rather than memorizing positions
💡 Why RoPE Dominates
RoPE (Rotary Position Embedding) has become the standard for new LLMs because it:
- Encodes relative positions naturally through rotation in complex space
- Allows models to extrapolate to longer contexts than seen in training
- Maintains computational efficiency (applied during Q/K computation)
- Empirically outperforms alternatives on long-context tasks
📄 Again, this is a little mathsy, but see if you can use this, along with ChatGPT to understand positional encodings
Positional Embeddings in Transformers: A Math Guide to RoPE & ALiBi
🔀 Modern Architectural Innovations
The frontier of LLM architecture isn't just about making models bigger—it's about making them smarter. Here are the key innovations driving 2024-2026 models.
💡 The Efficiency Revolution
Modern LLM research focuses on parameter efficiency: getting better performance without proportionally increasing compute costs. The key insight: not every parameter needs to activate for every input.
This shift—from "bigger is better" to "smarter is better"—is driven by Mixture of Experts architectures, attention optimizations, and clever training techniques.
🎯 Mixture of Experts (MoE)
MoE is perhaps the most important architectural innovation in modern LLMs. Instead of using all parameters for every token, route each token to a subset of specialized "expert" networks.
How MoE Works:
- Expert networks: Instead of one feed-forward network per layer, have 8-64 expert FFNs
- Router network: Small learned network decides which experts process each token
- Sparse activation: Only top-K experts (typically 2-8) activate per token
- Load balancing: Ensure tokens distribute roughly evenly across experts
- Combination: Outputs from active experts weighted and summed
MoE Benefits
- Massive parameter count: 100B+ total parameters with only 10-20B active per token
- Inference efficiency: Computational cost based on active parameters, not total
- Specialization: Different experts can learn different domains/patterns
- Scaling: Add more experts without proportional compute increase
MoE Challenges
- Training complexity: Load balancing is tricky—some experts might be underutilized
- Memory requirements: All experts must fit in GPU memory even if only few are active
- Communication overhead: Routing adds latency in distributed systems
- Instability: Careful tuning needed to prevent expert collapse
📊 MoE in Production Models
Mixtral 8x7B: 8 experts, 2 active per token → 47B total params, 13B active → performs like 47B dense model at cost of 13B
DeepSeek-V3: 256 experts, 8 active per token → 671B total params, 37B active → competitive with GPT-4 at fraction of inference cost
GPT-4 (rumored): Speculated to use MoE with 8-16 experts, explaining its size vs. inference speed
📹 Mixture of Experts explanation
📄 DeepSeek-V3 Github repo
Here is an open-weights Mixture of Experts model that you can, in theory, download and run (though I wouldn't do this on your laptop.
📏 Understanding Model Scale
When we say "GPT-4 has 1.76 trillion parameters" (a widely-circulated but unconfirmed third-party estimate, never officially disclosed by OpenAI) or "LLaMA 3 70B," what do these numbers actually mean? And is bigger always better?
Parameter Count Breakdown
Parameters are the learned weights in the neural network. They're distributed across:
- Embedding layers: Token + position embeddings (vocab_size × embedding_dim)
- Attention layers: Q, K, V, output projections for each head, each layer
- Feed-forward networks: Two linear layers per transformer layer (typically 4× hidden size)
- Layer norms: Small contribution (scale + shift per layer)
🔢 Quick Math
A 7B parameter model might have: 32 layers × (12 attention heads × 4096 dim + 4× FFN expansion) ≈ 7 billion parameters
Parameters vs. Active Parameters
For MoE models, these are different concepts:
- Total parameters: All weights in all experts (what's reported as "model size")
- Active parameters: Weights used for any single forward pass
- Example: Mixtral 8x7B has 47B total params but only 13B active per token
- Inference cost: Determined by active parameters, not total
Dense vs. Sparse Architectures
- Dense models: All parameters active for every input (GPT-3, Claude, LLaMA 2)
- Sparse models (MoE): Subset of parameters active per input (Mixtral, DeepSeek-V3, Switch Transformer)
- Trade-off: Sparse models offer better parameter efficiency but increased training complexity
- Trend: Major labs moving toward sparse architectures for flagship models
🎯 When Bigger ≠ Better
The relationship between model size and performance is nuanced:
| Scenario | Smaller Model Wins | Larger Model Wins |
|---|---|---|
| Latency-sensitive applications | ✓ Faster inference, lower latency | ✗ Slower, requires more compute |
| Resource-constrained deployment | ✓ Runs on smaller GPUs, edge devices | ✗ Needs high-end infrastructure |
| Narrow domain tasks | ✓ Can be fine-tuned effectively | Diminishing returns |
| Complex reasoning | Limited capability | ✓ Better at multi-step problems |
| Rare/specialized knowledge | Likely to hallucinate | ✓ More knowledge encoded in parameters |
| Few-shot learning | Requires more examples | ✓ Better in-context learning |
Research principle: Match model size to your task. A well-trained 7B model often outperforms a poorly-prompted 70B model. And for many research tasks (data analysis, writing assistance, literature review), mid-size models are sufficient.
🔮 The Current Frontier (Feb 2026)
A caveat first: The sizes and architectures of proprietary frontier models are not officially disclosed by their makers. Every parameter figure below for a closed model is a third-party estimate — informed guesses, not facts.
Claude Opus (4.x): Anthropic does not publish Claude's architecture or parameter count. Stanford's Foundation Model Transparency Index records both as “not disclosed”. Where people speculate, the guess leans dense (every parameter active per token) rather than Mixture-of-Experts — but this is unconfirmed, and no specific size should be cited.
Mixture-of-Experts, the documented cases: The clearest MoE examples are the open ones. DeepSeek-V3 is a 671B-total / 37B-activated MoE (DeepSeek-AI, 2024), and Mixtral routes each token through 2 of 8 experts — 47B total but only ~13B active per token (Jiang et al., 2024). Because these are published, the figures are verifiable.
GPT-4 (reported ~1.8T, MoE): Per an unofficial industry report by SemiAnalysis (Patel & Wong, July 2023), GPT-4 is ~1.8 trillion parameters using 16 experts — a leak, not an OpenAI disclosure.
GPT-4o (estimated ~200B): Third-party estimates put GPT-4o roughly an order of magnitude smaller than GPT-4. A Microsoft / Univ. of Washington benchmark paper lists GPT-4o at ~200B and GPT-4 at ~1.76T (Ben Abacha et al., 2025), explicitly noting these are mined from public articles and unverified; Epoch AI independently estimates ~200B (and warns it “could easily be off by a factor of 2”).
Small models: Open models in the 7B–13B range (e.g. Llama, Mistral) are genuinely disclosed and run on modest hardware — the figures here are real, not estimated.
The trend: The frontier has shifted from chasing raw parameter count toward architectural efficiency — sparse MoE routing, distillation into smaller models, and test-time (reasoning) compute. Bigger is no longer automatically better, and the most reliable numbers are the ones the makers actually publish.
📄 Scaling Laws for LLMS
Understanding the current state of LLM scaling and the future of AI research
🔬 Try this yourself (about 10 minutes)
This page has been conceptual; here is a hands-on way to see tokenisation, which underpins everything above. Open a free tokeniser playground — for example the OpenAI tokenizer — and paste in two or three sentences from your own field, deliberately including the messy bits: a piece of jargon, a chemical formula or equation, an author's name, a word in another language.
Then look at what happened:
- ▸How many tokens did your text become? Common English words are usually one token; rare or technical terms often shatter into several.
- ▸Where did your specialist vocabulary fragment most? That fragmentation is part of why models sometimes mishandle domain-specific terms — the model never sees your word as a single unit.
- ▸Now imagine a whole paper at this token rate. That total is what has to fit inside the context window — the limit this lesson's positional-encoding section is ultimately about.
It is a small exercise, but it converts “tokens” from an abstraction into something you have watched happen to your own writing.
📚 Summary & Key Takeaways
You now understand the technical architecture of modern LLMs:
- Transformer fundamentals: Decoder-only architecture, layer structure, residual connections
- Attention mechanisms: Self-attention, multi-head variants (MHA/MQA/GQA), how tokens learn relevance
- Positional encoding: RoPE and ALiBi enable models to understand sequence order and extrapolate length
- Mixture of Experts: Sparse activation allows massive models with efficient inference
- Scale considerations: Bigger isn't always better—match model to task, consider active vs. total parameters
Next session (Week 2.2): We'll explore how these architectures are actually trained—the pre-training process, optimization techniques, and the computational resources required to create LLMs from scratch.